# Multimodal Large Language Model

SAIL 7B
Apache-2.0
SAIL is a single Transformer model specifically designed for vision and language, serving as a unified Multimodal Large Language Model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture.
Image-to-Text Transformers
S
ByteDance-Seed
119
11
Internvl3 2B AWQ
Other
InternVL3-2B is an advanced Multimodal Large Language Model (MLLM) developed by OpenGVLab, featuring exceptional multimodal perception and reasoning capabilities, supporting tool usage, GUI agents, industrial image analysis, 3D visual perception, and more.
Transformers Other
I
OpenGVLab
677
1
Internvl3 1B
Other
InternVL3-1B is a 1B-parameter multimodal large language model in the InternVL3 series, integrating the InternViT visual encoder and Qwen2.5 language model, with exceptional multimodal perception and reasoning capabilities.
Transformers Other
I
FriendliAI
71
1
Ovis2 1B Dev
Apache-2.0
Ovis2-1B is the latest member of the Ovis series of multimodal large language models (MLLM), focusing on structural alignment of vision and text embeddings, featuring high performance for small models, enhanced reasoning capabilities, video and multi-image processing, and multilingual OCR enhancement.
Text-to-Image Transformers Supports Multiple Languages
O
Isotr0py
79
1
Video R1 7B
Apache-2.0
Video-R1-7B is a multimodal large language model optimized based on Qwen2.5-VL-7B-Instruct, focusing on video reasoning tasks, capable of understanding video content and answering related questions.
Video-to-Text Transformers English
V
Video-R1
2,129
9
Finedefics
Finedefics is an open-source multimodal large language model (MLLM) that enhances fine-grained visual recognition (FGVR) capabilities by incorporating object attribute descriptions.
Image-to-Text
F
StevenHH2000
82
6
Videorefer 7B Stage2.5
Apache-2.0
VideoRefer-7B is a multimodal model based on a video large language model, focusing on spatio-temporal object understanding tasks.
Text-to-Video Transformers English
V
DAMO-NLP-SG
20
2
P MoD LLaVA NeXT 7B
Apache-2.0
p-MoD is a hybrid-depth multimodal large language model built using the progressive ratio decay method, supporting image-to-text generation tasks.
Image-to-Text
P
MCG-NJU
74
4
Eagle X5 7B
Eagle is a series of vision-centric high-resolution multimodal large language models, supporting input resolutions up to 1K and above, excelling in tasks such as optical character recognition and document understanding.
Image-to-Text Transformers
E
NVEagle
918
26
M3D LaMed Llama 2 7B
Apache-2.0
M3D is a 3D medical image analysis technology based on multimodal large language models, including the M3D-Data dataset, M3D-LaMed model, and M3D-Bench evaluation benchmark.
Image-to-Text Transformers
M
GoodBaiBai88
209
2
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase